Shares

Large language model artificial intelligence applications (LLM AIs) seem poised to have a significant effect on the practice of medicine, both good and bad, which is why we are giving it as much attention as we are here. LLMs give impressive results when tested on medical knowledge, able to pass multiple-choice exams designed for general medical and specialty certification. In fact it did not take long for many of the top LLMs to outperform human physicians on such exams.

However – anyone who has used LLMs can likely tell you that LLMs cannot truly think. Understanding the strengths and weaknesses of such models is critical to their incorporation into medicine, either in education, research, or patient care.

What do I mean when I say they cannot think? They are very sophisticated chatbots able to regenerate human language, and this creates a facsimile of human thought, but they are not actually sentient and do not have any genuine understanding. Therefore, initial answers from an LLM can be impressive – they have all their training data at their disposal, can search through this data, and even live on the web, and formulate the results into coherent speech. It powerfully creates the illusion that they understood the question.

But when you did deeper you can find that some of their answer is complete nonsense. LLMs can “hallucinate”, which means they make up answers that are not real. They follow the format of an answer (complete with fake but perfectly formatted references) but are not real. This is fatal to any medical application. Further, if you start to push them with follow up questions their answers can start to break down. They contradict themselves, forget previous instructions, and overall just don’t “understand” the context.

All of this has lead medical experts studying the use of AIs in medicine to ask if we are assessing them in the most meaningful way. They do great on multiple choice questions, but perhaps that is not the best way to assess their clinical decision making. Recent studies have therefore focused on more challenging tasks, such as creating an initial differential diagnosis, diagnostic plan, and treatment plan based on a clinical vignette, and then following through the entire clinical case to a final diagnosis and treatment. When tested in this more meaningful way, consistent with the experience I outlined above, LLMs tend to break down and their performance is much less impressive.

“LLMs were tested across 29 clinical vignettes (representing 16 254 responses in total). PrIME-LLM scores ranged from 0.64 (range, 0.63-0.65) (Gemini 1.5 Flash) to 0.78 (range, 0.77-0.79) (Grok 4), with reasoning-optimized models outperforming nonreasoning models and GPT models scoring highest overall. Differential diagnosis was less accurate than diagnostic testing, while final diagnosis, management, and miscellaneous reasoning were more accurate. Failure rates exceeded 0.80 (range, 0.90-1.00) for differential diagnosis in all models but were less than 0.40 (range, 0.09-0.39) for final diagnosis. Multimodal performance was robust; most LLM models showed improved accuracy with image inputs.”

In short, that is not good enough to practice medicine. I taught medicine for 30 years so I have lots of evidence-informed thoughts about what constitutes clinical competence. We generally break down clinical ability into three categories – competence, expertise, and mastery. Competence basically means that when given sufficient information about a straightforward case you know standard diagnostic and treatment protocols. This is about the level you get to when you finish your internship. At this level you know what to do maybe 95% of the time, as long as you don’t get thrown any curveballs – rare or unusual cases or situations.

The conventional wisdom in medicine (not literally true, but to help put things into perspective) is that you need 5% of your medical knowledge to handle 95% of cases, and 95% of your medical knowledge to handle the other 5% of cases. What this means is that as cases get progressively complex and difficult, you need more and more knowledge and experience to be able to successfully manage them. That one really rare and quirky case will push your medical knowledge to its limits.

Expertise basically means you have enough of that knowledge and experience to handle that top 5% of cases in terms of complexity. Interns need to be competent; an expert is someone who has completed their residency, perhaps even subspecialty training, and has practiced long enough to have their own body of experience. Mastery means you can handle the top 0.1% of the most difficult and complex cases – you are a recognized expert in the field.

On that spectrum, LLMs appear to be barely at the competency level, which is not enough to actually practice medicine. But also, LLMs are good test-takers, but do significantly worse in something closer to a clinical setting – precisely because they don’t really think. They cannot filter a patient’s history through an understanding of how people react to and talk about their own illnesses. They are particularly bad at filling in missing gaps in information.

Here is another way to look at it – all medical educators know that when we teach our students about patient cases the biggest challenge is that we are presenting prepackaged cases “tied in a bow”. Someone has already taken all the information from the case and assembled it into a vignette, which means they have already done a lot of filtering. They have made sure critical information is present, and have likely even subconsciously shaped the case knowing the outcome. Such cases are often neat, tidy, and clean, even despite our efforts to make them more challenging.

The real world, by contrast, is messy and unpredictable. LLMs do well within controlled environments. They now do great on multiple choice questions. They do significantly worse on following a case vignette from initial assessment to final diagnosis and plan. I suspect they would do far worse in the uncontrolled environment of an actual clinic or hospital, with lots of noisy and even missing information. These kinds of information-noisy environments tend to produce hallucinations and incoherent errors in LLMs.

What does all this mean? It means that LLMs are not ready, not even close, to doing anything like practicing medicine. But no one is suggesting that they are or should be used that way. It is important, however, for the general public to know that they cannot use a public chatbot as if it were a substitute for sound medical advice. But LLMs can make great tools for either education (although again with limits) or as a tool that is used by an expert.

The strengths and weaknesses of AI vs humans are complementary in many ways. LLMs properly trained have the advantage of thorough knowledge and the ability to sort through vast amounts of information. For a human to acquire this level of knowledge takes years, and they can never truly get there because new information is produced so quickly it’s literally impossible to keep up, except in a very narrow area of expertise. LLMs are therefore great for filling in gaps in human knowledge, and making suggestions that a human doc may have missed. They can also be good at analytical knowledge, knowing in precise detail statistics on, for example, how a test result affects the probability of a diagnosis.

But they are terrible at the intuitive aspect of clinical decision-making, the creative aspect, at seeing the whole picture and putting it into a human context, and filling in missing pieces in a way that can make sense out of incomplete and noisy data. Experienced human clinicians can be great at all that. So the two working optimally together can be a powerful team.

The challenge for the medical profession is determining how to work optimally together. This requires a lot more testing of how LLMs behave, and hopefully improving their medical knowledge and reasoning while minimizing problems like hallucinations. But there always needs to be a human clinician in the loop – LLMs do not replace human thought, they augment it.

Will such medical expert systems be used that way, however. Mark Crislip recently wrote here about the worst-case scenario of AI in medical practice. Unfortunately this seems like a highly plausible scenario – in which the temptation to allow LLMs to think for you is just too great, and new doctors, trained in an era of AI, never develop the ability to think like expert clinicians. We therefore get stuck with LLM level competence, without true expertise. This is the medical version of AI slop.

But my optimistic side says that the medical education institutions will not let this happen. Professionalism and academic integrity will save the day, finding that sweet spot of incorporating AI as a clinical tool without letting it do your thinking for you.

I suspect we will get something in between these two extremes. If I had to guess I would say that the ultimate effect of AI on the practice of medicine will be to exaggerate the existing extremes. Mediocre doctors will become even more mediocre, what marginal skills they may have developed atrophying as they become increasingly a rubber stamp for whatever AI tells them to do. Meanwhile the best clinicians will use AI to become even better, able to leverage the awesome power of AI to quickly get to the critical information they need, and to plug any holes that even the best clinicians have.

The question is – what will the net effect be on the average clinician? I hope it is positive. We will see.

Shares

Author

  • Founder and currently Executive Editor of Science-Based Medicine Steven Novella, MD is an academic clinical neurologist at the Yale University School of Medicine. He is also the host and producer of the popular weekly science podcast, The Skeptics’ Guide to the Universe, and the author of the NeuroLogicaBlog, a daily blog that covers news and issues in neuroscience, but also general science, scientific skepticism, philosophy of science, critical thinking, and the intersection of science with the media and society. Dr. Novella also has produced two courses with The Great Courses, and published a book on critical thinking - also called The Skeptics Guide to the Universe.

    View all posts

Posted by Steven Novella

Founder and currently Executive Editor of Science-Based Medicine Steven Novella, MD is an academic clinical neurologist at the Yale University School of Medicine. He is also the host and producer of the popular weekly science podcast, The Skeptics’ Guide to the Universe, and the author of the NeuroLogicaBlog, a daily blog that covers news and issues in neuroscience, but also general science, scientific skepticism, philosophy of science, critical thinking, and the intersection of science with the media and society. Dr. Novella also has produced two courses with The Great Courses, and published a book on critical thinking - also called The Skeptics Guide to the Universe.